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Abstract 

a 

The problem of publishing personal data without giving up privacy is becoming increasingly 
important. A clean formalization that has been recently proposed is the fc-anonymity, where 
£■ — the rows of a table are partitioned in clusters of size at least k and all rows in a cluster 

t-H become the same tuple, after the suppression of some entries. The natural optimization 

problem, where the goal is to minimize the number of suppressed entries, is hard even when 
jy^ the stored values are over a binary alphabet and as well as on a table consists of a bounded 

number of columns. In this paper we study how the complexity of the problem is influenced by 
different parameters. First we show that the problem is W[l]-hard when parameterized by the 
value of the solution (and fc). Then we exhibit a fixed-parameter algorithm when the problem 
is parameterized by the number of columns and the maximum number of different values in 
any column. Finally, we prove that fc-anonymity is still APX-hard even when restricting to 
instances with 3 columns and fc = 3. 
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t-h 1 Introduction 

CO 

In epidemic studies the analysis of large amounts of personal data is essential. At the same 
time the dissemination of the results of those studies, even in a compact and summarized form, 
can provide some information that can be exploited to identify the row pertaining to a certain 
individual. For instance, ZIP code, gender and date of birth can uniquely identify 87% of indi- 
viduals in the U.S. [18J. Therefore when managing personal data it is of the utmost importance 
^ to effectively protect individuals' privacy. 

One approach to deal with such problem is the fc-anonymity model [HI [181 113 02] • Each 
row of a given table represents all data regarding a certain individual. Then different rows are 
clustered together, and some entries of the rows in each cluster are suppressed (i.e. they are 
replaced with a *) so that each cluster consists of at least k identical rows. Therefore each row r 
in the resulting table is clustered with at least other fc — 1 rows identical to r, hence the resulting 
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data do not allow to identify any individual. While such formulation is not really sophisticated 
and has some practical limitations, it is definitely interesting from a theoretical point of view, 
as witnessed by the rich literature available. We will focus on separating the cases that can be 
solved efficiently from those that are intractable, therefore hinting at which strategies are likely or 
not going to be successfully employed when studying more sophisticated formalizations. Notice 
that different formulations of the problem have also been proposed [I], for example allowing the 
generalization of entry values, that is an entry value can be replaced with a less specific value [3], 
or considering a notion of proximity among values |10j . 

A parsimonious principle leads to the optimization problem where we want to minimize the 
number of entries in the table to be suppressed. The /c-anonymity problem is known to be 
APX-hard even when the matrix entries are over a binary alphabet and k = 3 [6], as well as 
when the matrix has 8 columns and k = 4 (this time on arbitrary alphabets) [6]. Further- 
more, a polynomial-time 0(A;)-approximation algorithm on arbitrary input alphabet, as well as 
approximation algorithms for restricted cases are known [2j. Recently, two polynomial-time ap- 
proximation algorithms with factor 0(log/c) have been independently proposed |144 lllj. 

In this paper we investigate the parameterized complexity [U [13] of the problem, unveiling 
how different parameters are involved in the complexity of the problem. A first systematic study 
of the parameterized complexity of the fc-anonymity problem has been proposed in [7\. Here, 
we follow the same direction, showing that the problem is W[l]-hard when parameterized by the 
size of the solution and k, and we provide a fixed-parameter algorithm, when the problem is 
parameterized by the number of columns and the maximum number of different values in any 
column. These problems were left open in [7J. 

In Table [l] we report the status of the parameterized complexity of the /c-anonymity problem, 
where in bold we have emphasized the new results presented in this paper. We recall that a 
problem P parameterized by a set Y of parameters is in the class FPT [8] if it admits an exact 
algorithm with complexity /(Y)^^, where / is an arbitrary function, and n is the size of the 
input problem, while it is W[i]-hard [8], for some 1 < i < p if it is unlikely to be fixed-parameter 
tractable. We recall that XP [8] is a superclass of all sets W[p]. Moreover, proving that a problem 
II with parameter set S is NP-hard when all parameters in S are some constants, implies that 
(n, S) i XP unless P = NP. 
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Table 1: Summary of the parameterized complexity status of the fc-anonymity problem; |S| 
represents the maximum number of different values in a column, m represents the number of 
columns, n represents the number of rows, k represents the minimum size of a cluster, e represents 
the size of the solution. 

The rest of the paper is organized as follows. In Section [2] we introduce some preliminary 
definition and we give the formal definition of the /c-anonymity problem. In Section [3] we show 
that the A:-anonymity is W[l]-hard. In Section [4] we give a fixed parameter algorithm, when the 
problem is parameterized by the size of the alphabet and the number of columns. Finally, in 
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Section [5] we show that the 3-anonymity problem is APX-hard, even when the rows have length 
bounded by 3. 

2 Preliminary Definitions 

Let us introduce some preliminary definitions that will be used in the rest of the paper. Given 
a graph G = (V,E), and V C V, the subgraph induced by V is denoted by G[V] = (V',E'), 
where E' = E n (V x V). A graph G = (V, E) is cubic when each vertex in V has degree three. 

Given an alphabet E, a row r is a vector of elements taken from the set E, and the j-th 
element of r is denoted by r[j']. Notice that it is equivalent to consider a row as a vector over E or 
as a string over alphabet E, Let ri, r2 be two equal- length rows. Then H(n, r2) is the Hamming 
distance of n and r2, i.e. |{i : n[i] 7^ ?"2[*]}|- Let P be a set of Z rows, then a clustering of P is 
a partition II = (Pi, . . . , P t ) of P. Given a clustering II = (Pi, . . . , P t ) of R, we define the cost 
of the row r belonging to a set Pj of II as cn(r) = |{j ; : 3ri,r2 € Pi, 7^ ^2 bill) that is the 

number of entries of r that have to be suppressed so that all rows in Pj are identical. Similarly 
we define the cost of a set Pj, denoted by cn(Pj), as |Pj||{j : 3r\,r2 G Pj, / ^2[i]}|- The 

cost of II, denoted by c(II), is defined as ^2p i& u c (^i)- Given a set 5 C R and a clustering II 
of R, the cost induced by II in set S is cn(S) = ^CreS Cn ( r )' Notice that, given a clustering 
II = (Pi, . . . , P t ) of R, the quantity |Pj| max ri)r2e p i {fl'(n, Tq)} is a lower bound for c(Pj), since 
all the positions for which ri and r2 differ will be deleted in each row of Pj. We are now able to 
formally define the k- Anonymity Problem (fc-AP). 

Problem 1. k-AP. 

Input: a set R of equal lenght rows over an alphabet 

Output: a clustering U = (P\, . . . , Pt) of R such that for each set Pi, |Pj| > k and c(Jl) is 
minimum. 

In what follows, given a set S of parameters, we denote by {S)-AP the A;-AP problem param- 
eterized by S, thus omitting k. We will consider the following parameters: m is the number of 
columns of the rows in R; n is the number of rows in R; |E| is the maximum number of different 
values in any column of the table; k is the minimum size of a cluster; e is the maximum number 
of entries that can be suppressed. 

Let II = (Pi,...,P 2 ) be a solution of the k-AP problem. Notice that a suppression at 
position j of a row r is represented replacing the symbol r[j] with a *. Given a set Pj of II, 
some entries of the rows clustered in Pj are suppressed, so that the resulting rows are all identical 
to a vector r over alphabet E_r U {*}; such a vector is the resolution vector associated with Pj. 
Given a resolution vector r, we define del(r) as the number of entries suppressed in r, that is 
del(r) = \{j : r[j] = *}\. Given a resolution vector r and a row r, G R, we say that r is compatible 
with row rj iff r[j] 7^ rj[j] implies r[j] = *. Given a row of R and a set of resolution vectors S' , 
we define the set comp(ri, S') = {r G S" : r is compatible with rj}. 

Given a set P of rows, we define a group of rows of P as a maximal set of identical rows. 
Given a group g, the representative row of g, denoted by r(g), is any row of g, while s(g) is the 
number of rows in g and exc(g) = max{0, s(g) — k}. A set R of rows can be partitioned in groups 
of identical rows in polynomial time [7], therefore we can compute in polynomial time whether 
a set R of rows is /c-anonymous, i.e. R can be partioned into groups of size at least k. If this 
is not possible, then observe that at least k entries of R must be suppressed to get a solution of 
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the k-AP problem, that is e > k. Hence (e)-AP is in FPT iff (e, k)-AP is in FPT. Consequently 
our parameterized reduction [HJ [13] will show the fixed-parameter intractability of (e)-AP and 
(e, fc)-AP. 

3 (e)-AP and (e,k)-AP are W[l]-hard 

We show that (e)-AP and (e, k)-AP are W[l]-hard. Given an set R of equal length rows, (e)-AP 
and (e, £;)-AP ask if there exists a clustering IT = (Pi, . . . , Pt) of R such that \Pi\ > k for each set 
Pi, and c(II) < e. We present a parameter preserving reduction from the /i-Clique problem, which 
is known to be W[l]-hard [9], to the (e)-AP problem. Given a graph G = (V,E), an /i-clique is a 
set V' C V where each pair of vertices in V are connected by an edge of G, and \V'\ = h. The 
/i-Clique problem asks for a subset V' of the vertices of a given graph G inducing an /i-clique in 
G. 

Clearly the vertices of a /i-clique are connected by (2) edges. Given a graph G = (V,E), we 
use me and «g to denote respectively the number of edges and of vertices of G. We construct 
the instance R of (e)-AP associated with G. First, let us define k = 2h 2 . The set R consists of 
(k + l)niG + {k— (2)) rows and 2h + uq columns over alphabet = {0, 1} U {aij : (vi, Vj) G E}. 
More precisely, for each edge e(i, j) = (vi, Vj) in E, there is a group R(i, j) of k + 1 identical rows 
ra;(ij j), 1 < x < k + 1, where 

• r x (i,j)[l] = a it j, for 1 < Z < 2h; 

• r x (i,j)[2/t + i] = 1, r x (i,j)[2h + j] = 1; 

• r x (i, j)[2h + /] = 0, for I ^ i,j and 1 < I < n. 

Moreover, R also contains a group i?o made of k — (2) identical rows equal to Q 2h+n G t 

Lemma 1. Let R be the instance of {e)-AP associated with G and consider two rows r,r x (i,j) 
of R, such that r £ Rq and r x (i,j) £ R(i,j). Then, r[t] 7^ r x (i, j) [t] , for each 1 < t < 2h. 

Lemma 2. Let G = (V, E) be a graph, let V' be a h-clique of G and let R be the instance of 
{e)-AP associated with G. Then we can compute in polynomial time a solution IT of {e)-AP over 
instance R with cost at most 6/1 3 . 

Lemma 3. Let G = (V, E) be an instance of h-Clique, let R be the instance of {e)-AP associated 
with G and let IT be a solution of (e)-AP over instance R with cost at most 6h s . Then we can 
compute in polynomial time a h-clique V of G. 

Proof. First we will prove that IT must have a set R' D Rq. Assume to the contrary that in II 
there are two sets A, B containing at least a row of Rq. Notice that |i?o| < k while \A\, \ B\ > k. 
Moreover, by Lemma [TJ all rows in A or B must have suppressed the first 2h entries, which 
results in at least Ahk > 6/1 3 suppressions, contradicting the assumption on the cost of the 
solution. Hence, Rq is properly contained in a set R' Q of n, as |i?o| < k. Moreover, let r' be a row 
of R'q \ Rq and let r be a row of G Rq. By Lemma [l] r' [t] 7^ r[t] for each column t, 1 < t < 2h, 
therefore all entries in the first 2h columns of each row in R' Q must be suppressed. 

Now, let us prove that, for each set R(i,j) of R, there exists a set R'(i,j) of n such that 
R'(i,j) Q R(hj)- Assume to the contrary that no such set R'(i,j) exists, for a given R(i,j). 
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Then either R(i,j) C R' Q or there exists a row of R(i,j) clustered together with a row of R(x,y) 
in II, with (x,y) ^ In the first case, that is R(i,j) Q R' , \R' \ > 2k + 1 — (2), by 

construction all entries of the first 2h columns of the rows in R' Q must be suppressed, resulting 
in at least 2h(4h 2 — Q)) > 6/i 3 suppressions and thus contradicting the assumption on the cost 
of the solution. Consider now the second case, that is there is a set A in II containing at least a 
row of two different sets R(i,j) and R(x, y) of R. Observe that given r' £ R' \ Rq and r G Rq, r 
and r' differ in the first 2h columns. Thus the entries of the first 2h columns of the rows of R' 
must be suppressed, resulting in at least Ahk > 6/1 3 suppressed entries and thus contradicting the 
assumption on the cost of the solution. Hence, for each set R(i,j) of R, there exists a set R'(i,j) 
of n such that R'(i,j) C R(i,j). 

By our previous arguments we can assume that II consists of the clusters R' Q and R'(i,j), for 
each R(i,j) G R, and that \R(i,j)\ — 1 < \R'(i,j)\ < \R(i,j)\. Notice that only R' can contain 
some suppressed entries. Also \R' Q \ = k, for otherwise we can improve the cost of II by moving 
a row in R(i,j) H R' from R' to R'(i,j). Now let E' be the set of edges (vi, Vj) of G such that 
a row of R(i,j) is in R' Q and let V' be the set of vertices incident on at least an edge in E' . 
Then we can show that G[V] is a /i-clique. Notice that the entries in the first 2h columns of R' 
must be suppressed, as well as all columns with index 2h + I such that v\ G V , since in those 
columns all rows in Rq have value while some row in R' \ Rq have value 1. An immediate 
consequence is that the overall number of suppressed entries is at least 2hk + &|V'|. Since, by 
hypothesis, the number of suppressed entries is at most 6/1 3 = 3kh, then \V'\ < h. Notice that, 
since |i?o| = k — (2) and \R' \ = k, then R' \ Rq contains exactly distinct rows corresponding 
to edges in E' incident on V' vertices. Hence V induces a /i-clique in G. □ □ 

From Lemma [2] and [3j our reduction is parameter preserving, therefore (e)-AP and (e,fc)-AP 
are W[l]-hard. 

4 An FPT algorithm for (|E|,m)-AP 

In this section we present a fixed-parameter algorithm for the (|S|,m)-AP problem, that is the 
instance of the AP problem, where the number m of columns and the maximum number |E| of 
different values in any column are two parameters. Notice that A;-AP parameterized by exactly 
one of |S| or m is not in FPT, as k-AP is APX-hard (hence NP-hard) even when one of |E| or m 
is a constant [6]. 

Before giving the details of the algorithm, let us first introduce some preliminary definitions. 
Let R be an instance of m)-AP, and for each column of R with index j, 1 < j < m, let Sj be 
the set of different values that the rows of R have in column j. Notice that |£j| < for each 
1 < j < m. Let S* = Sj U {*} and S* = SU {*}. Assume n = {Pi, • • • , P z } is a feasible solution 
of (|E|,m)-AP over instance R. The set S' consisting of a resolution vector for each set F; £ II 
is called candidate set for solution (|E|,m)-AP. Let S be the set of possible rows of length m 
and having value over alphabet for the position 1 < j < m, then \S\ is bounded by |S*| m . 
Given a candidate set S', notice that S' C S and that each row r G R must compatible with at 
least one resolution vector in S' . 

Given a row r and the set S' of resolution vectors, recall that we denote by Comp(r, S') the 
set of resolution vectors of S' compatible with r. Moreover, given a resolution vector r' G S' , we 
denote by del{r') the number of suppressions in r' . For each row r G R we define its weight as 
w(r) = max ra . 6 c r omp(r,5')^' 7i — del(r x )}- Notice that w(r) = m whenever r is compatible with a 
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Algorithm 1: Solving (|£|,m)-AP 

Input: An instance R of (|S|,m)-AP made of a set of n rows, each one consisting of m 

symbols, and an integer e 
Output: a solution of (|S|,m)-AP over instance R, if (|S|,m)-AP admits a solution that 
suppresses at most e entries; 
l S <— the set of resolved vectors of length m, where each j-th symbol, 1 < j < m, is taken 
from the alphabet E^; 

3 foreach subset S' of S do 

4 Gr^s' <— the graph associated with R, S'\ 

5 M <— a maximum matching of Grs 1 ', w the weight of M; 

6 if M is feasible and w > (W + l)k\S'\ + m\R l dist U R l safe \ - e then 

7 return the solution Us'(M) of R associated with M; 

8 return No such solution exists 



4.1 



row without suppressions. Informally, the weight of a row is equal to the maximum number of 
its entries that might be preserved in a solution where S' is the set of resolution vectors. Finally, 
we define W = ^2 r£R w(r) and w'(r x ) = W + m — del(r x ) + 1 for each row r x £ S' . Notice that 
w'(r x ) > Y^reR w ( r )' f° r eac h r x £ R- The weights defined above will be used later in Section 
to define the weight function Wh- 

Let us first describe the general idea of the algorithm. Given a candidate set S', the algorithm 
computes an optimal solution Tig' associated with a candidate set S' C S (see Algorithm [T]) . The 
algorithm consists of two main phases. In the first phase (Section |4.1[ ), given the set R of input 
rows and the candidate set 5", the algorithm builds a weighted bipartite graph Gs',r associated 
with R and S' . In the second phase (Section 4.2) a solution of (|S|,m)-AP is computed starting 
from a maximum weighted matching of the graph Gs',r- Section 4.3 is devoted to prove that the 
solution computed by the algorithm is optimal. 



4.1 Building the graph Gr^' 

Let us consider a candidate set S' of vectors for an optimal solution of (|S|,m)-AP. Since S' C 
S, there exist at most 2l s *l m possible candidate sets of rows S' , therefore our FPT algorithm 
computes each candidate set S' and verifies if there exists a solution 11$' with cost at most e. In 
order to verify if such a solution exists, the algorithm builds a bipartite graph Gr^'j as described 
in this section. The intuitive idea behind the graph is that edges of the graph correspond to 
possible ways of assigning each row in R to a resolution vector x E S' . Rows assigned to the same 
resolution vector x 6 S' are clustered in the solution LT5'. 

The construction of the vertex set of the graph is based on a a partition of R into two disjoint 
sets called R sa fe an d Rdist (that is Rdist = R\Rsafe)- The set R sa fe consists of those rows r G R 
belonging to the group g such that: s(g) > k, that is r belongs to a group of at least k identical 
rows, and there exists a row r.j E S' , such that rj and r(g) are the same vector. Notice that only 
rows in R sa fe might have no suppressed entry in a solution LT5/ . 

The vertex set of Gr^s' = (V, E) has 6 sets. Two sets {R l dist , Rdist) consist of vertices associated 
with the rows in Rdist, three sets (R sa f e , R l sa fe' Rlafe) cons i s t of vertices associated with the rows 
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in R sa f e , and a final set called T consists of vertices associated with the rows in S' . In the latter 
case notice that for each row x in S' there exist k vertices in T to ensure that the cluster associated 
with x has size at least k. The vertex set is defined as follows: 

• for each row x G Rdist, there is a corresponding vertex R l dist (x) in R l dist and a corresponding 
vertex R r dist (x) in R r dist ; 

• for each group g consisting of the set of rows {x\,X2, ■ ■ ■ , x s r g \}, where each x% G R sa fe, 
1 < i < s(g), there are k corresponding vertices in Rg a f e , (such vertices are denoted 
by R's a f e (g, 1), • • • , Rg a f e (g, k)), exc(g) corresponding vertices in R l sa j e (such vertices are 
denoted by R sa f e (g, 1), R l sa f e (g,exc(g)), and exc(g) corresponding vertices in R r sa j e 
(such vertices are denoted by R r sa f e (g, 1), • • • , R r sa fe(9i exc (d))'i 

• for each row x G S' , there are /c corresponding vertices in T (such vertices are denoted by 
T(x,l),...,T(x,k)). 

Notice that our graph Grs' is edge-weighted. Let be the weight function assigning a 
positive weight to each edge of G^g'- Given the set of edges E' C E, we denote by Wh(E') = 

First, notice that the set S' consists of two disjoint sets: the set S^, a j e consists of those rows 
in S' that have no suppressions, while S' cost = S' \ S' sa ^ e . Each edge connects a vertex of Rg a f e U 

Riafe ^ ^dist w ^ a vertex of R r sa f e U R r dist U T, hence the graph Gr : s' is bipartite. The set S' 
consists of two disjoint sets: the set S^, a j e consists of those rows in S' that have no suppressions, 
while S' cost = S'\ S' sa j e . Intuitevely, each edge represents a possible assignment of a row in R to 
a resolution vector in S' . 



Algorithm 2: From a matching to a feasible solution of (|S|,m)-AP. 
Input: A graph G^s' associated with an instance R and a maximum weight matching M 
of G R>S > 

Output: A solution Hgi(M) of (|S|,m)-AP over instance R 
l foreach edge y of M do 



2 if y = (R l dist {r), T(x, j)) then /* edges defined at point 1 */ 

3 row r is assigned to a set whose resolution row is x, x G <S" 

4 if y = {R l dist (r),R r dist {r)) then /* edges defined at point 2 */ 

5 row r is assigned to a set whose resolution row is r y = arg maxti;(r), r y G S"; 

6 if y = {R'safeidi i)iT(x,j)) then /* edges defined at point 3 */ 

7 assign the i-th row of g to a set whose resolution row is x, x G S'; 

8 if y = (R l sa f e (g,i),T(x, j)) then /* edges defined at point 4 */ 

9 assign the i-th exceeding row of g to a set whose resolution row is x, x G S'; 

10 if y = (.Rlafeid'^'Rlafeidi *)) tnen /* edges defined at point 5 */ 

n assign the i-th exceeding row of group g to the set whose resolution row is r(g), 

with r(g) G S' and r G R sa f e ; 



Now we are ready to define formally the set of edges E of Gn t s' and the weight function Wh- 
There are five possible kinds of edges. 
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1. Let r be a row of Rdist-, and let x be a row in Comp(r, S') n S' cost . Then there is an edge 
V = (R l dist (r),T(x, j)), for each 1 < j < k, with weight Wh (y) = w'{x). 

2. Let r be a row in i?^. Then there is an edge y = (R l dist (r), R r dist {r)) with weight Wh (y) = 
w(r). 

3. Let g be a group consisting of rows {n, . . . , ?V S )}, where rj, for each i with 1 < z < s(g), 
is a row of R sa fe'i let r' be the resolution vector of S' sa f e identical to r(g). Then there is 
an edge yi = (R'^ a ^ e (g,i),T(r' ,£)), for each i with 1 < i < k. All edges yi have weight 
w h (yi) = w'(r'). 

4. Let g be a group consisting of rows {n, . . . , 7V S )}, where r^, for each i with 1 < i < s(g), 
is a row of R sa fei let x be a row in Comp(r(g), S') H S 1 ^. Then there is an edge j/jj = 
(R l sa j- e (g,i),T(x, j)), for each i with 1 < i < exc(g) and for each j with 1 < j < k. All 
edges yij have weight w h (y itj ) = w'(x). 

5. Let g be a group consisting of rows {r%, . . . , r s r g -\}, where r» 3 1 < z < s(y), is a row of R sa f e - 
Then there is an edge ?/j = {R l sa j e {g, i), R r sa j e (g, i)) for each i with 1 < i < exc(g). All edges 
yi have weight (yi) = w(r(g)). 



4.2 Computing a solution of (|E|,m)-AP 

In this section we prove in Lemma [6] that Hg/(M) is a clustering of the rows in R that is a feasible 



solution for the (|S|,m)-AP problem. See Fig. 4.2 for an example. 



Since G^s' bipartite, we can efficiently compute a maximum weight matching M of G^s' [17J. 
Given a matching M of the graph Gr^s 1 , Algorithm [2] computes in polynomial time a clustering 
Hgt(M) of the rows in R. Informally, the clustering is computed by assigning the rows in R to 
the resolution vector in S' , using the edges in the matching M. 

Notice that, each vertex R l sa f e (r, i) has only the edge (R l sa ^ e (r, i),T(r, i)) on it, hence we can 
always add those edges to any matching^J Let M be a matching of Gr.s' an d let v be a vertex of 
Gr 5', then we say that v is covered by a matching M if there exists an edge of M for which v is 
one of its endpoints. Moreover, we will say that M is feasible if all vertices in T are covered by M. 
When a matching M covers all vertices in R l dist U R l sa j e and is feasible, it is defined as a complete 
matching. Let II be a clustering of an instance R of the (|S|,m)-AP problem. Then II is feasible 
if and only if each set of the partition IT contains at least k rows. The next part of this section is 
devoted to show that every maximum weight matching M is complete and that clustering Ug/(M) 
is feasible. First, we will show in the next two lemmata that, given W' = k^2 r eT w'(r x ), W is 
a threshold that distinguishes between matchings that are feasible and those that are not. 

Lemma 4. Let M be a matching of Grs 1 , let X be the subset of T consisting of the vertices of 
T that are covered by M , and let M\ be the subset of the edges of M that have one endpoint in 
X. Then the total weight of the edges in M\ is exactly Xyr(ti)eJf W '^S)- 

Proof. It is an immediate consequence of the observation that all edges where an endpoint is 
T(t,j) have the same weight w'(t), with t E S'. □ □ 



1 Notice that these connected components are introduced only to simplify the relationship between a matching 
M and the corresponding solution IIs'(M) of (|E|,m)-AP 
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Rows R 
Name Data w Group 
7"i aaa 3 

r? aaa 3 

5— 9i 

r3 aaa 3 

r4 aaa 3 

r§ aba 2 52 

r 6 bbb 2 ff 3 

r7 bbc 2 34 



Resolution vectors S' 
Name Vectors w 

s\ aaa 21 

s 2 a*a 20 

S3 bb* 20 

Figure 1: An instance i? of (|S|,m)-AP, with k = 2 and m = 3, a resolution vector set S' and 
the associated graph Gr^s 1 - The thick edges are a maximum weight matching of Gr^s'- The 
corresponding solution is made of the sets {r\,r2-,rz\ (cost 0), {ri,r$} (cost 2), {r^,^} (cost 2). 

Lemma 5. Let M be a matching of Gr^s' an d ^ Mi be the subset of the edges of M that have 
one endpoint in T. Then the total weight of the edges in Mi is at least W' = k^2 reS , w'(r) if 
and only if M is feasible. 

Proof. Let Mi be the subset of the edges of M that have one endpoint in T, and let W\ be the 
total weight of edges in M\. An immediate consequence of Lemma [4] is that Wi = W if and 
only if Mi is feasible. Assume now that M is not feasible, then there exists at least one vertex 
S'(x,j) G T that is not covered by M. Again, a consequence of Lemma[4]is that W\ < W — w'(x). 
Let M2 be the set M \ M\. By construction, w'{x) > W and W is an upper bound on the total 
weight of M2, therefore Wi + io/ l (M2) < W, completing the proof. □ □ 

Using Lemmata [4] and [5j we can prove Lemma [6j 

Lemma 6. Let M be a maximum weight matching of Grs'> then M is complete and the solution 
Hs'(M) computed by Algorithm^ is feasible. 

4.3 Proving the optimality of Hs'(M) 

This section is devoted to prove that, starting from a maximum weight matching M, Algorithm[2] 
computes an optimal solution Tlgi(M) of (|S|,m)-AP. In order to prove that any maximum weight 
matching M of the graph Gr^s' leads to an optimal solution of (|S|,m)-AP over instance R, we 
are going to prove that Y,( u ,v)eM w h ((«,«)) > (W + l)k\S'\ + m\R l dist U R l safe U R'j afe \ - e if 
and only if (|S|,m)-AP over instance R admits a solution with cost not greater than e, and such 
solution is computed by applying Algorithm [2] Such result will be obtained through a sequence 
of technical lemmata. 
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Since M is a maximum weighted matching, we can assume by Lemma [6] that M is complete. 
Given a complete matching M, we denote by M(T) the set of edges of M with one endpoint in 
R l dist U R sa f e U R'lafe an< ^ one endpoint in T, while we denote by M(L) the set of those edges of 
M that have one endpoint in R l dist U R l sa j e and one endpoint in R r dist U R r sa f e - Furthermore, let us 
denote by V(T) the set of vertices of R l dist U -R^a/e ^ R'safe ^ na ^ are endpoints of an edge in M(T) 
and by V(L) the set of vertices of R l dist UR l sa j- e that are endpoints of an edge in M(L). Notice that 
by definition of V(L) and, by definition of complete matching, V(T)UV(L) = R l dist L)R l sa j: e L)R'^ e . 
Finally, let us denote by R(L) the set of rows in R associated with the vertices in V(L). LemmaJT] 
shows how the weight of a complete matching M is related to the edge weights of Grs> ■ 

Lemma 7. Let M be a complete matching of Gr s 1 , and let Wh{M) be the total weight of M . 
Then w h (M) = k ZreS'( W + m ~ dd (r) + 1) + Ere W m " del ^ = ( W + ^l 5 '! + m \ R dist u 
RLfe U R'L fe \ - (k EreS> dd(r) + £ refl(L) del(r)). 

In the next two lemmata, we will show that: (i) given an instance R of (|S|,m)-AP, if there 
exists a solution of (|S|,m)-AP over R that suppresses at most e entries then the graph Gr^s* 
associated with R admits a complete matching of Gr } s> with total weight wg(M) > (W+l)fc|<S"| + 
m \R l dist ^ ^so/e ^ R'safel ~ e ' ( u ) §i ven a complete matching of the graph Gr^' of total weight 
w G (M) > (W + l)k\S'\ + m\R l dist U R l safe U R' l safe \ - e, Algorithm [2] returns a solution U S '(M) 
of (|S|,m)-AP that suppresses at most e entries. These lemmata, coupled with Lemma[6j prove 
the correctness of Algorithm [2] in Theorem 10 



Lemma 8. Let R be an instance of (\Y,\,m)-AP, let II5/ be a feasible solution of (\T,\,m)-AP 
over instance R that suppresses at most e entries, let Gr ; s' be the graph associated with R and 
S' . Then there exists a complete matching of Gr^s' with total weight wq(M) > (W + l)fc|£"| + 

Lemma 9. Let R be an instance of m)-AP, let Gr : $' be the graph associated with R, and let 
M be a complete matching ofG RtS > of weight w h (M) > (W r +l)A;|S"|-|-m| J R^ gt Ui?^ /e U^ a/e |-e. 
Then, starting from the matching M ofGR^s 1 , Algorithm^ computes a feasible solution IIs'(.M) 
of m)-AP over instance R, where there are at most e suppressions. 

Proof. Since M is complete, for each vertex T(x,j) of T, with 1 < j < k, there exists an edge 
(v,T(x,j)) G M for some v E (R l dist UR l safe UR'^ afe ). Then Algorithm [2] defines a solution IL S >(M) 
for (|E|,m)-AP assigning, for each edge (v,T(x,j)), the row r corresponding to vertex v to the 
set that has resolution vector x. More precisely, row r is defined by Algorithm [2] as the j-th 
element of the set that has resolution vector x. Therefore each set associated with a resolution 
row x E S' will consist of at least k rows compatible with x. Hence Hs'(M) is a feasible solution. 

Recall that M has a total weight of at least (W + l)k\S'\ + m\R l dist U R l safe U R a safe \ - e. We 
will prove that Hs>(M) induces at most e suppressions. By Lemma J w h {M) = kJ2 r&S '( W + 
m-del(r) + l) + EreR(L)rn-dd(r) = (W+l)k\S'\+m\R l dist UR^ 

EreR(L)dd(r)) > (W+l)k\S'\+m\R di ^UR l safe UR l safe \-e where k E reS ' de K r ")+Er 6j R(i) de K r ) < 
e. Notice that, by definition of Hs>(M), each vertex of V(T) corresponds to a row in R assigned 
to a set with a resolution vector in S' . Such rows associated with V(T) induce a cost in Ug/(M) 
of k Ere5' del(r). Furthermore, the vertices of V{L) corresponds to rows of R inducing a cost of 
at most Erei?,(L) del(r). Therefore Ilg'(M) induces &EreS' del(r) + J2reR(L) del(f) < e suppres- 
sions. □ □ 
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Theorem 10. Let R be an instance of m)-AP. Then Algorithm^ returns a solution Hs>(M) 
of cost at most e if and only if such a solution exists. 

Proof. By Lemma [61 Hs>(M) is feasible. Hence if Hs'(M) suppresses at most e entries, then 
(|S|,m)-AP admits a solution of cost at most e. On the other hand, by Lemma[8j if there exists 
a solution II' of R that suppresses at most e entries, then there exists a feasible matching M with 
weight w G (M) > (W + l)fc|S'| + rn\R l dist U R l safe U R' l safe \ - e. Then, by Lemma [jjj Algorithm IT] 
returns a solution Tls>(M) of (|E|,m)-AP that suppresses at most e entries. □ □ 

If (|S|,m)-AP admits a solution that suppresses at most e entries, then there exists a set 
S* of resolution vectors such that LT5* is a solution for (|E|,m)-AP with resolution vectors S* 
with the property that lis* suppresses at most e entries. Now, there exist 0(2^ s l +1 ) m ) possible 
sets of resolution vectors and the construction of graph G^g' requires O^IS^H-Rl) < 0(ke\R\) < 
0(kmn 2 ). A maximum matching M of a bipartite graph can be computed in polynomial time |17j 
and starting from M, we can compute a solution of the (|S|,m)-AP in time 0(|M|) < 0(m). 
Hence the overall time complexity of the algorithm is 0(2^ s l +1 ) m kmn 2 ). 

5 APX-hardness of 3-AP(3) 

In this section we investigate the computational and approximation complexity of 3-AP(3), that 
is k-AP when each row consists of exactly 3 columns and k = 3. We show that 3-AP(3) is APX- 
hard via an L-reduction from Minimum Vertex Cover on Cubic Graphs (MVCC), which is known 
to be APX-hard Due to page limit, we only sketch the proof. The MVCC problem, given a 
cubic graph G = (V,E), asks for a smallest C C V such that each edge of G has at least one of 
its endpoints in C. 

Let G = (V,E) be instance of MVCC, where \V\ = n and \E\ = m. The reduction builds 
an instance R of 3-AP(3) associating with each vertex Uj G V a set Ri consisting of 9 rows, and 
with each edge e = (vi, Vj) S E a set Eij consisting of 7 rows. Finally, a set A of 3 more rows is 
added to R. 

Now we can describe formally our reduction. Let R{ be the set of rows associated with vertex 
Vi G V. The rows in Ri have values over an alphabet = {<7j, cr^i, 0^2, 0^3}. The set Ri 
consists of 9 rows belonging to 6 groups, denoted by g\ (t>j), . . . ,g%{vj), of identical rows. The 
representative rows of groups gi(vi), . . . ,ge(vi), and the cardinality of the groups, are defined as 
follows: 

- r(g h (vi)) = o-i t h,o~iO~i,h, with h G {1,2,3}; each group gh(vi), with h G {1,2,3}, consists of 
exactly two rows; 

- r (93+h(vi)) = vmo-^h, with h G {1,2,3}; each group g3+h(vi), with h G {1,2,3}, consists of 
exactly one row. 

Notice that given two rows r, r' belonging to different groups of Ri, H(r, r') = 1 iff r G gh(vi), 
r' G g-3+h(vi) (or the converse) or r,r' G {#4^), g 5 (vi), g$ (vi)}. Given a group gh(vi), with 
h G {1,2,3}, each symbol cr^ is called the private symbol of gh(vi)- The groups of rows gj(vi), 
with j G {1,2,3}, are denoted as the docking groups of Ri, and each of them is associated with 
a set Ei h of rows encoding an edge (vi,v^) of G. More precisely, given the set of rows Eij, we 
denote by dij{g(vi)) the docking group of Ri associated with set Eij. 

Now, let us build the set Eij of rows associated with an edge (vi,Vj). Let dij(g(vi)) and 
di.j(g(vj)) be the two docking groups of Ri and Rj respectively, associated with the set E{j. 
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Let Oi :X and Oj >y be the private symbols of groups dij(g(vi)) and dij(g(vj)) respectively. The 
set Ei j consists of 7 rows distributed in 6 groups. The rows of Eij have values over alphabet 
= { a i,xi &j,y, o~i,j, a i,j,5i a i,jfi}- Let us define the representative rows and the cardinality 
of the groups in Eij: 

- r(gi(vi,Vj)) = ai lX (Tij(Ti )X ; group gi(vi,Vj) consists of a single row; 

- r(g 2 (vi,Vj)) = ai tX Oi,jOj, y ; group g 2 (vi,Vj) consists of two rows; 

- r (93(vi,Vj)) = &j,y<7i,j(7j,y; group g 3 (vi,Vj) consists of a single row; 

- r(g t (vi,Vj)) = aij,tcn,jcn,j,t, with t G {4, 5, 6}; each group g t (vi, Vj), with t G {4, 5, 6}, consists 
of a single row. 

The group of Ely that has two occurrences of symbol o~i iX shared with dij(g(vi)) is called 
the i-group of set Eij, and is denoted as g l {vi,Vj). Notice that, given two rows r,r' of Ri, ir- 
respectively, then H(r,r') = 1 iff r G dij(g(vi)) and r' G g l (vi,Vj). 

Finally, a set X of 3 rows xi,x 2 ,x 3 are added to R. The rows in X have values over an 
alphabet T, x disjoint from any other set Xj, Ey. Each row Xi = wf, and it has Hamming distance 
3 from any other row of R. Therefore for any set C containing some rows Xi, all positions of a 
row in C will be suppressed. 

Now, consider the set Ri. The following lemma gives a lower bound on the cost of an optimal 
solution of 3-AP(3) over instance Ri. 

Lemma 11. Let Ri be a set of rows, then an optimal solution of 3-AP(3) over instance Ri has 
a cost of at least 9. 

The main idea of the reduction is showing that we can consider a set of solutions, called 
canonical solutions, that is solutions where: 

(i) LT contains exactly one cluster X containing only suppressed entries; 

(ii) each set Ri is associated with either a type a or a type b solution (to be defined later), 
eventually with the contribution of some rows in the sets Eij for a type b solution; 

(iii) two sets Ri, Rj are associated with a type b solution only if there is no edge set Eij in 
the instance R, that is the corresponding vertices Vi, Vj are not adjacent in G; 

(iv) either an edge set is part of a type b solution of some set Ri and has a total cost of 10 or 
it has a total cost of 11. 

Notice that, by construction, in a canonical solution, rows xi,x 2 ,xs G X. 

Let us define the notions of type a and type b solution. Given a set Ri and the edge sets 
Eij, E it h, Eij, a type a solution for Ri consists of three sets S^i, Si j2 , 5^3, where = gt{vi) U 
9t+z{v-i), while a type b solution consists of the following sets: (i) three sets dij{g{vi)) U g l (vi, Vj), 
di,h{g{vi))U g l (vi,v h ), du(g{vi))U g l {vi,vi); (ii) g A {vi) U g 5 (vi) U g 6 (vi). 

Lemma [12] is the main technical contribution of this section. 

Lemma 12. Let U be a solution of3-AP(3 ) over instance R. Then we can compute in polynomial 
time a canonical solution II' of 3-AP(3) over instance R such that c(n') < c(n). 

Sketch of the proof By direct inspection, it is immediate to notice that type a and type b solutions 



induce 9 suppression in rows of Ri hence, by Lemma 11, they are optimal for Ri. The next step is 
computing in polynomial time a solution II" such that each set Ri is associated in II" only with 
either a type a or type b solution, and such that c(n") < c(II). Such step is obtained by exploiting 
the optimality of type a and type b solutions for Ri, and some properties of the instance R. 
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Then, starting from such solution II", we can compute in polynomial time a canonical solution 
II' such that c(n') < c(bf"). The main idea to prove this result is that for any two sets Ri, Rj, 
such that both Ri and Rj are associated with a type b solution in II" and Eij is part of the 
instance R, then we can improve the solution by imposing a type a solution for Ri. □ 



A consequence of Lemmata |11| and |12| and some properties of the instance R, is Lemma 13 



Lemma 13. Let II be a solution of 3-AP(3) over instance R of cost 6\V\+3\C\ + ll\E\ + 9, then 
we can compute in polynomial time a solution of MVCC over instance G of size C . 

Proof. Let us consider a canonical solution of 3-AP(3). First , notice that the three rows w\, W2, 
W3 provide together a cost of 9. Since two sets of rows are associated with a type b solution only 
if there does not exist a set Eij, on the contrary, given an edge set Eij at least one of the set Ri 
and Rj is associated with a type a solution. Consequently, the set of rows associated with a type 
a solution corresponds to a vertex cover of the graph G. 

Now consider the cost of a canonical solution. For each set Ri of rows associated with a type 
b solution, we can show that each of the three edge sets Ei~, Eih, En has a cost of 10. Notice 
that, given an edge set Eij, if both sets Ri, Rj are associated with type a solutions, then we can 
show that the edge set Eij has a cost of 11. Accounting this decreasing of the cost of the edge 
sets to the set Ri of rows with a type b solution, is equivalent to assign to a type b solution a cost 
equal to 6, while a type a solution has a cost equal to 9. □ 



Similarly to Lemma 13 , we can prove that starting from a solution C of MVCC over instance 
G, we can compute in polynomial time a solution II of 3-AP(3) over instance R of cost 6|V| + 
3|C| + ll\E\ + 9. Therefore 3-AP(3) is APX-hard. 
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Appendix 



Proofs of Section [3] 
Proof of Lemma Q] 

Lemma 14. Let R be the instance of (e)-AP associated with G and consider two rows r,r x (i,j) of R, 
such that r £ Rq and r x (i,j) £ R(i,j). Then, r[t] ^ f x (i,j)[t], for each 1 < t < 2h. 

Proof. By construction, r[t) = for all t with 1 < t < 2h, while r x (i,j)[t] — a^. □ □ 

Proof of Lemma [2] 

Lemma 15. Let G = iV,E) be a graph, let V' be a h-clique of G and let R be the instance of (e)-AP 
associated with G. Then we can compute in polynomial time a solution II of (e)-AP over instance R with 
cost at most 6ft 3 . 

Proof. Initially let II' be a solution consisting of clusters Rq, R(i,j), for each R(i,j) £ R. For each R(i,j), 
let ri(i,j) be the first row of R(i,j). Compute a new solution II consisting of clusters R' a , R'(i,j), for each 
R(i,j) £ R, where: 

• = R{hi) \ {n(ij)}, for each v i: Vj £ V; 

• R'{i,j) = R(i,j), for Vl i V or Vj t V; 

• Ho=RoU R{i , j)eR (R'(i,j)\R(iJ)) 

Notice that, since V is a /i-clique, |i2g| = k. Moreover, by construction, \R(i,j)\ > \R'(i,j)\ > \R(i,j)\ — 1, 
therefore II is a feasible solution for R. Notice also that no entries is suppressed in the rows of each set 
R'(i,j), therefore to determine the cost of II' it suffices to determine the number of entries deleted in R' , 
and we will show that such number is exactly 6 ft 3 . 

Indeed, by construction, for each column t of the first 2h columns, and for each row r £ Rq and 
r x (i,j) £ R(i,j), r[t] ^ r x (i,j)[t], hence all the entries of the first 2h columns of the rows in R' must be 
deleted, resulting in 2hk suppressions. Now let us consider the columns with index 2h + l<t<2h + n 
and v t £ V . In such positions, all rows of Rq are equal to 0, while all rows in the sets R(y,t), R(t,y) are 
equal to 1. Consider the h ^ h ~ 1 ^> of R' \ Rq. As the corresponding edges are incident on a set of h vertex, 
by construction there exists a set H of exactly h columns, with H = {t : 2h + 1 < t < 2h + n}, where at 
least one of the rows in R' \ R is equal to 1, while the rows in R are all equal to 0. Since in any other 
column all rows in R' have value equal to 0, hence there are additional hk suppressions for the columns 
with index 2h + 1 < t < 2h + n. Overall, the number of suppressions is 3/ifc which, by the choice of k is 
equal to 6/i 3 . □ □ 

Proofs of Section |4] 
Proof of Lemma [6] 

Lemma 16. Let M be a maximum weight matching of Gr^s'i then the solution IIs/(.M) computed by 
Algorithm^ is feasible. 

Lemma [6] is a consequence of Lemmata |19| and |17| 

Lemma 17. Let M be a maximum weight matching ofGn^s', then M is a feasible matching. 

Proof. First notice that, as M is feasible, each vertex of T is covered and each vertex R l sa f e (r, j), with 
1 < j < k, is covered by M. Assume that M is not complete and that a vertex R l dist (r) of R l dist (resp. 
R sa f e {r,k + j), with 1 < j < exc(g), of R l sa r e ) is not matched. Then, by construction, also the vertex 
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R di s t( r ) 01 R dist ( res P- R safe( r >j) oiR safe) is not covered by M, as R l dist (r) (resp. R l safe (r,k + j)) is the 
only vertex adjacent to R dist (r) (resp. R r sa f e {r,j)) in Gr^s 1 ■ Hence we can compute the matching M' by 
adding all the edges of M to W and by adding edges (R dist (r), R dzst (r)), (resp. (R l safe (r, k+j), R r safe (r, j))) 
for each vertex R l dist (r) (resp. R l sa f e {r, k + j)) not covered by M. □ □ 

Lemma 18. Let M be a feasible matching of Gj^g', if M is not complete, then we can compute in 
polynomial time a complete matching M' , such that uih{M r ) > Wh(M). 

As a consequence of Lemma |18| we assume in what follows that any matching M is complete. Fur- 
thermore, we can prove the following result. 

Lemma 19. Let M be a complete matching of Grs'- Then Algorithm^ computes in polynomial time a 
feasible clustering LT5/ (M) . 

Proof. Since Hgi(M) feasible, all vertices in T are covered by M. Furthermore, we can assume, by 
Lemma 
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that each vertex in R l dist U R l safe is covered by M. Hence each row in R is assigned by 



dist safe 

Algorithm |to a set whose resolution vector is S' . Furthermore Algorithm [2] assigns to each set with 
resolution vector x e S' at least k rows. Hence the clustering lis 1 (M) computed by Algorithm [2] is 
feasible. □ □ 



Proof of Lemma [7] 

Lemma 20. Let M be a complete matching of Grs 1 , then the total weight of M, u>h(M), is equal to 
k Eres' (W + m- del(r) + 1) + £ rGfl(L) (m - del{r)) = (W + l)k\S'\+m\R l dist U R l safe \ - (k del(r) + 

J2reR(L) dd ( r ^- 

Proof. The total weight Wh(M) of the matching M is defined as 

Wh{M) = ^ w h ((u,v))+ 2J w h (u,v). 

(u,v)£M(T) (u,v)£M(L) 

By Lemma [5] and by definition of the weight function Wh, it follows that 

w h {M) = k ^2 w '( r ) + ^2 ( m ~ del(r)) 

r£S' reR(L) 

and by definition of w'(r) it holds 

w h {M) = k^2(W + m-del(r) + l)+ ^ (m-del(r)). 

r£S' reR(L) 

Hence 

w h (M) = (W + m+l)k\S'\-k^2del(r)+ TO " H del ^- 

rSS' r£R(L) r£R(L) 

By definition of feasible matching and by Lemma [l8| |V(T)| = \T\. Furthermore, since \T\ = k\S'\, then 
mk\S'\ = m\T\ = m\V(T)\. By construction E refi(i) m = m\V(L)\ and V(T) U V(L) = R l dlst U R l safe . 
Hence 

w h (M) = (W + l)k\S'\+m\R l dist UR l safe \-(kJ2^(r)+ E del ^ 

r£S' r£R(L) 
□ □ 
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Proof of Lemma [8] 

Lemma 21. Let R be an instance o/(|E|, m)-AP, let Tig* be a feasible solution of (\Y>\,m)-AP over instance 
R that suppresses at most e entries, let Gr^s' be the graph associated with R and S' . Then there exists a 
complete matching of Gb.,S' with total weight wg{M) > (W + l)fc|S"| + rn\R l dist U R l sa f e U Rg a f e \ — e. 

Proof. Since II5/ is feasible, we notice that each set of IIs< associated with a resolution vector r e S' must 
have cardinality at least k. Furthermore, we assume that all the sets of Hs r are all associated with different 
resolution vectors, otherwise we can merge all the sets with the same resolution vector without increasing 
the cost of II5/ . 

Let 1 be a row of S' and denote by R x C R the set of rows of R assigned to the set associated with 
resolution vector x. Starting from ITs< we compute incrementally a matching M by adding edges. First, 
for each set of vertices T(x, i), 1 < i < k, let i* be the minimum number such that T{x, i*) does not have 
any edge incident on it in M. First, assume that x € S' sa j e ; add the edge (R'!: a f e (g,i),T(x,i)) to M, for 
each 1 < i < k. Now, assume that x € S' cost . Scan the rows in R x and for each row r in R X1 if r € Rdist add 
the edge (R l dist (r),T(x,i*)) to M. If r e R sa fe and belongs to group g add the edge (R l sa j e (g,i),T(x 7 i*)) 
to M. If no such T(x,i*) exists, then no edge is added to M. Notice that by construction, since all sets 
in S' have at least k rows, then all vertices of T are covered by M, therefore M is feasible. 

Finally add to M all edges (R l dist (x), R r dist (x)), (R l safe (g, i),Rr safe (g, i)),l<i< exc(g), for each vertex 
in {R l disV R l sa f e } respectively that is not already covered in M. Hence M is complete. 

Given a solution IIs<, a resolution vector x of 5" and the corresponding matching M, consider the 
order in which the rows of a set R x are scanned sequentially to construct M. Each of the first k rows 
assigned to a cluster with resolution vector equal to x, by construction corresponds to an edge of M 
joining a vertex of V(T) and a vertex of T. Since M is complete, those rows have a total cost in lis* of 
k^2 reS , del(r). The remaining rows of R correspond to vertices of V(L). Notice that those rows have 
a total cost in IT 5" not larger than ^2 rt z R ( L )del{r). By Lemmafflw^M) = (W + l)fc|S"| + rn\R dist U 
R l sa f e U R's a f e \ — (kJ2 re s' del(r) + J2 r eR(L) del(rj). Since II5/ suppresses at most e entries of R, then 
e < kJ2 reS ,del(r)+J2 reR(L) del(r), therefore Wh (M) > (W+l)k\S'\+m\R l dlst UR L safe UR' s l afe \-e. □ □ 

Proofs of Section \5\ 

It is easy to see that, by construction, the following properties hold. 

Proposition 22. Let r a , r\, be two rows of Ri, with r a = gj(v{) and rj, = gi(vi), j < I. Let r c be a row of 
Rj , with i =/= j ■ Then: 

• H(r a ,r c ) = H(r b ,r c ) = 3; 

• H(r a ,r b ) < 2; 

• H(r ai r b ) = 1 iffr a = g h {vi) and r b = g h+3 (vi), with 1 < h < 3, or r a = g h (v t ) and r b = gi(vi), with 
4 < j < I < 6. 

Proposition 23. Let r a , r b be two rows of Eij, with r a £ gh( v i, v j) an d r b £ gi(vi,Vj), with h < I. Let 
r c , r c i be two rows of Ri and R p , with p =/= i,j, and let r e be a row of E t z , with t ^= i or z ^= j . Then: 

• H(r a ,r b ) < 2; 

• H(r a ,r b ) = 1 iffr a = gh(Vi,Vj) and r b = g h+1 (Vi,Vj), with l<h<2; 

• H(r a ,r c ) = 1 iff r c is in the docking group dij(g(Vi)) of Ri and r a is in the group g' l {vi,Vj); 

• H(r a ,r c ) = 2 only if r c is in a group adjacent to dij(g(vi)); 

• H(r a ,r d ) = 3; 

• H(r a) r e ) = 3. 
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In what follows, by an abuse of notation, we may use a group g(-) to denote its representative row 
r g (-). Fig. [5]shows the groups of Ri, Rj, Eij. Each group of identical rows si represented with a vertex, 
while an edge joins two vertices iff the corresponding groups are at Hamming distance 1. 




Figure 2: Groups at Hamming distance 1 in Ri, Rj, Eij-. vertices represent groups, while an edge 
joins two vertices representing groups at Hamming distance L 



Proof of Lemma 1111 

Lemma 24. Let Ri be a set of rows, then an optimal solution of 3-AP(3) over instance Ri has a cost of 
at least 9. 

Proof. Let us consider the set of 9 rows, distributed in 6 groups, of Ri. As none of the group of Ri consists 
of at least 3 rows, it follows that any solution of 3-AP(3) suppresses at least one entry in each row of Ri, 
hence the lemma follows. □ 

Proof of Lemma 1121 

In order to prove Lemma fl2| first we have to show some properties of a canonical solution. 

Lemma 25. Let Ri be a set of rows. Then a solution of 3-AP(3) over instance R induces an optimal cost 
for the set Ri if it is a type a solution. 

Proof. By construction, a type a solution is an optimal solution over instance Ri, as each row has a cost 
of 1 in a type a solution. □ 



Now, in Lemma 26 we will prove a property of a type b solution over the sets Ri, Ei j, Ei h, Ei 



Lemma 26. Let S be a type b solution of 3-AP(3) over instance RiU EijU Ej^U E^k, then S suppresses 
9 entries in the rows of Ri, and 1 entry for each row of g l (vi,Vj), g l (vi,Vh), g*{vi,Vk)- 

Proof. By construction each set of a type b solution containing a docking group of Ri consists of three 



rows, where exactly one position is suppressed for each row by Prop. 23 The cluster gi(vi) U gs(vi) Uge(vi) 
consists of 3 rows, where exactly one position is suppressed for each row. By a simple counting argument, 
the rows of Ri have a total cost of 9. □ 

Let LT be a solution of 3-AP(3) over instance R and let Eij be an edge set. Then we say that II induce an 
i-normal solution for Eij if it contains the following three sets: (i) one set clusters C\ — g l {vi, Vj)L)dij(vi), 
(ii) one set containing C2 3 Ute{4 5 6} 9t( v ii v j)i such that exactly two entries (in columns 1 and 3) are 
suppressed in the rows of C2, (hi) set C3 = 32(^1, Vj) U g 3 Vj). 
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Lemma 27. Let II be a solution of 3-AP(3) over instance R and let E^j be an edge set, then II suppresses 
at most 10 entries in the rows of Eij only if it induces an i-normal or j-normal solution for Eij, 

Proof. First assume that II induces an i-normal solution for Eij. By Prop. [23| it follows that one entry 



of g*(vi, Vj) is suppressed. In the set consisting of the rows in gt(v%, Vj), with t £ {4, 5, 6}, by Prop. 23 two 
positions for each row are suppressed. Finally, by Prop. 23 in the set g2{vi,vj) U g J (Vi,Vj), exactly one 
position is suppressed for each row. 

Now, let us prove that if II is a solution that is not i-normal or j-normal for Eij, then cn{E^j) > 11. 
Notice that that each row in g t (vi,Vj), with t £ {4,5,6}, has a Hamming distance 2 from any other row 
of R \ gt(vi,Vj), hence at least two entries are suppressed in each solution II. Furthermore, notice that 
each of the four rows in the groups gi(vi,Vj), g2( v i, v j), 93( v fo v j) must have a cost of at most 1. But 
then, the rows of g%{vi, Vj) must be co-clustered with the row of exactly one of gi(vi, Vj), 33 (wi, Vj) (w.l.o.g. 
gi(vi,Vj)). But then g${vi,Vj) = g i (vi,Vj), must be co-clustered with dij(vi). □ 

Lemma 28. Let S be a solution of 3-AP(3) over instance R, then we can compute in polynomial time a 
solution S' such that c(S') < c(S) and S' contains at most one set suppressing three entries for each row. 

Proof. Assume that solution S contains sets Y%, . . . , Y p , with p > 2, such that all the positions of the rows 
in Yj are suppressed. Then we can compute in polynomial time a solution S' by merging the set Y\, . . . , Y p 
in a single cluster Y. Notice that c(S') < c(S), as in both solution 5" and S three positions are suppressed 
for each row r £ Uj_i p Yj- ^ 

Now, let us first introduce some properties of a solution of 3-AP(3) over instance R. 

Lemma 29. Let II be a solution of i-AP(Z) over instance R, we can compute in polynomial time a solution 
II' such that 

1. for each edge set Eij, II' has a set Sij containing the rows of groups Ut=4 5 6 9t( v ii v z), such that 
for each row in Sij exactly two columns (columns 1 and 3) are supprssed; 

2. c(n') < c (n). 



Proof. First, notice that by Prop. 23 the rows in the groups Ut=4,5,6gt(vi,Vj) have distance smaller than 3 
only w.r.t. rows of E^j. Furthermore, notice that, by construction, each row in [J t=4 5 6 gti^i, Vj) may be 
equal to another row of R only in the second position. 

Assume that there exist clusters Si, £2, S3 (at most one of these clusters can be empty) containing 
the rows of (J t=4 5 6 gt(vi, Vj), such that at most two entries are suppressed for each row of the cluster Sj, 
j £ {1,2,3}. Then, for each row in Si U S 2 U S 3 , the positions 1 and 3 are suppressed. Hence, we can 
merge clusters Si, S2, S3, without increasing the cost of the solution, obtaining one set that contains the 

rows \Jt=4,5,6 9t(Vi,V z ). 

Assume that some rows of yJt=A,5,6gt{vi,Vj) are in the cluster X and some rows of ^t=4,,5figt{ v h v j) are 
in a different cluster Y, such that at most two entries are suppressed for each row of the cluster Y. Then 
we can move the rows of Ut=4,5,69t{ v fo Vj) H X to Y , decreasing the cost of the solution. 

Now, assume that these rows are all clustered in set X. It follows that each row of Ut = 4.5,egt(vi, Vj) 
have a cost of 3. Hence, we can move this set of rows to a new set Ut=i,5,69t(vi, Vj), decreasing the cost of 
the solution. □ 

Lemma 30. Let Ri, Rj be two set of rows and let Eij be an edge set of R. Let II be a solution of 3-AP(3) 
over instance R that associates a type b solution with both Ri, Rj. Then we can compute in polynomial 
time a solution n' of 3-AP(3) over instance R where exactly one of Ri, Rj is associated with a type b 
solution and such that c(n') < c(Il). 



Proof. Notice that, by Prop. 23 the rows of group 32 (vi, Vj) have Hamming distance 1 only from the rows 
of gi(vi,Vj) and g%{vi,Vj). Since in a type b solution the rows of g\ (vj,Vj) and g^{v^Vj) are co-clustered 



with rows of Ri and Rj, it follows by Prop. 23 that the rows of gi (uj, Vj) are co-clustered in n with rows at 
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Hamming distance at least 2. Hence, n suppresses two entries in each row of g%(vi, Vj), and, as 92^, Vj) 
consists of 2 rows, at least 4 entries of rows in 92(^1, Vj) are suppressed in n. Notice that, as Ri and Rj 
are associated with type b solutions in n, the only rows that can be clustered with the rows of 92(^,^7) 
are those of groups gi{v l ,v j ), gs(vi,Vj), g 6 (vi,Vj). 

Starting from solution n, let us compute a solution II' of 3-AP(3) over instance R as follows. Let 
Ei t, Ei :Z be the edge sets associated with the three edges incident in Uj. Modify solution II so that 
II' induces a type a solution for Ri, and a j-normal solution for Eij. Moreover, for each row of a group 
g z (vi,v z ) of an edge set E i<z , with z 7^ j, co-cluster such group g l (vi,v z ) with the cluster containing the 
rows of Ut=4,5,6 9t{vi, v z ). 

By Lemma [27] the rows of edge set E^j have a total cost of 10. Notice that by Lemma [29j we can 
assume that II has a set C containing (J t=4 5 6 gt(vi, v z), such that exactly two entries (corresponding to 
the positions 1 and 3) are suppressed for each row in C. Hence the representative row of C has Hamming 
distance 2 from g l (vi,v z ), as they are equal in position 2. 

Now, each of these two rows g l (vi, v z ) has a cost of 2 in n', while it has a cost of least 1 in n. Notice 
that each of the two rows of g% (u-j, Vj) has a cost of at least 2 in n, while it has a cost of 1 in n'. Hence, 

c(n') < c(n). □ 

Lemma 31. Let n be a solution of the Z-APfi) over instance R, such that two sets Ri, Rj are not 
associated with a type b solution in n and n induces a total cost of 10 for the rows in Eij . Then at least 
one of Ri, Rj has cost 11. 

Proof. Assume that n induces a total cost of 10 for the rows in E^j. Notice that the rows in (Jt=4 5 6 9t( v i> v j) 
have a total cost of 6, as by Prop. [23] they are at Hamming distance at least 2 from any other row of R. 
Furthermore, the 4 rows of Ejj \ (J t _ 4 5 6 gt(vi, Vj) must have a cost of at least 1 in n. Notice that, by 
n induces either an i-normal or j-normal solution for Ei j (w.l.o.g. we assume that is i- normal). 
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Lemma 

Hence g t (vi, Vj) is clustered with the rows of dij(vi)), while of g^(vi,Vj) is clustered with g2(vi,Vj), other- 
wise some rows of 32 (v%> Vj) are clustered in n with a row at Hamming distance at least 2, hence the total 
cost of the rows in Ei j is greater than 10. But then, we claim that n induces a cost of at least 11 for the 
rows of the set Ri. 

Now, recall that by hypothesis i?, is not associated with a type b solution in n, and let us consider the 
clusters containing rows of Ri in n. Notice that if at least two rows of Ri are clustered with some rows 
at Hamming distance at least 2, then n induces a cost of at least 11 for the set Ri. Recall that group 
dij(g{vi)) of Ri is clustered only with rows of group g l (vi,Vj) of Eij, and consider the cases that either 
the three groups of rows in g4(vi) , gs(vi) , g$(vi) are co-clustered, or not. In the former case, as Ri is not 
associated with a type b solution, it follows that the rows of at least one of the docking group of Ri are 
clustered with rows at Hamming distance 2; hence n induces a cost of at least 11 for the set Ri. In the 
latter case, let us consider the group of Ri (w.l.o.g. gi(vi)) adjacent to dij(g(v{)) and let C be the cluster 
containing the unique row of gi(vi). As the rows in gi(vi), g^Vi), ge(vi) are not co-clustered, it follows 
that C contains a row r at Hamming distance at least 2 from #4 (i>j). If r £ Ri, then n suppresses at least 
two entries of two rows of Ri, namely r and 34 (vi), hence n induces a cost of at least 11 in rows of Ri. If 



r Ri, then r must be a row at Hamming distance 3 from g^Vi). Indeed by Prop. |22| and by Prop. 23 
the rows at Hamming distance not greater than 2 from g^(vi) belong to Ri U (Ei.j \ (Jt=4 5 6 9t( v ii v j))- 
We have assumed that (Ri \ {g4(vi)}) n C — 0, and it must be (Eij \ U t=4 5 (,9t( v iT v j)) H C — 0, since 
by hypothesis n induces an i-normal solution for the rows in Eij. Hence g4(vi) must have cost equal to 3 
and must be part of the cluster X in n by the Lemma [28] It follows that n induces a cost of at least 11 
for the set Ri. □ 

Now, let us prove Lemma [12] 

Lemma 32. Let n be a solution of 3-AP(3) over instance R. Then we can compute in polynomial time 
a canonical solution II' of 3-AP(3) over instance R such that c(n') < cffl). 
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Proof. Let us consider the solution II. Before computing a canonical solution IT', we compute an interme- 
diate solution II" such that c(II") < c(II) as follows. First, for each set Ri, if Ri is associated with a type b 
solution in II, then define a type b solution for Ri in II". Otherwise, if each docking vertices of 
a set Ri is clustered in II with the row of group g l (vi,Vj) of Eij, then define a type b solution for Ri in 
II"; else define a type a solution for Ri in II". Furthermore, define a set containing row x\,x 2 ,X3 in II". 
Next, consider the rows of an edge set Eij, and define a clustering of the rows not yet clustered in II". 
If exactly one of Ri, Rj (w.l.o.g. Ri) is associated with a type b solution in II", then define an i- normal 
solution for JSj j in II". Else if at least one of Ri, Rj (w.l.o.g. Ri) is associated with a type a solution in 
II", then define the following solution: one set contains the rows in g' l (vi,Vj) U 92(11%, Vj)', one set contains 
the rows in lj t=4 5 e (gt{vi, Vj)) U g J (vi, Vj). If both Ri, Rj are associated with a type b solution in II", then 
define a set \Jt=4,5fi(9t(vi, Vj)) U g 2 (u l ,v J ) in II". 

Now, let us show that c(II) > c(II"). By Lemma 25 and by Lemma 26 it follows that for each row 
in a set Ri the cost in n" is optimal. Furthermore, by Lemma 28 we can assume that n contains a set 
X 2 {%i, x 2 , X3}, hence the rows in {x\, x%, x$} have all cost 3 in both n and n". Hence it remains to 
consider the cost of the edge set Eij. 



Let Eij be an edge set. Notice that by Lemma 29 we can assume that II contains a set Sij 5 



Ut=4 5 6 9t( v i, an d by construction n" contains a set S[ j 2 Ut=4 5 6 9t( v i, v j)- Hence each row of 
gt(vi, Vj), with t £ {4, 5, 6}, has a cost equal to 2 in both n, n". Let us consider the case when both sets 
Ri and Rj are associated with a type b solution in both n and n". The groups of Ei j not co-clustered 
in a type b solution of Ri, Rj, are g%(vi, Vj), g4,(vi,Vj), g$(vi,Vj), g§(vi,Vj). By construction, as the rows 
at Hamming distance 1 from g 2 (vi,Vj) are clustered in the type b solution of Ri, Rj in n (hence cannot 
be co-clustered with g2(vi,Vj)), it follows that the rows g 2 (vi,Vj) must be clustered with a row having 
Hamming distance at least 2 in n. As n" contains the set S'/j = (U{=4 5 6 9t( v i, v j)) U 92( v i, Vj) and as n 
contains the set Sij 3 (J t _ 4 5 6 9t{ v i, v i), it follows that the cost of the rows in Ei j in solution n is greater 
or equal than the cost of the rows in Ei j in solution n". 

Let us consider the case when exactly one of the sets Ri and Rj (w.l.o.g. Rj) is associated with a type 



b solution in n" and in H By construction, Ri is associated with a type a solution in n". By Lemma 27 
it follows that that cn"(Eij) = 10, and, as each row in [J t=4 5 Qgt(vi,Vj) has a cost of 2 in n", it follows 
that each row in Eij \ (Ut=4 5 6 9t( v i> v j)) nas a cos t of 1 in n". As n contains the set Si j, it follows that 
n suppresses two entries in the rows of (J t _ 4 5 6 9t(vi, Vj), hence the cost of the rows in Ei j in solution n 
is greater or equal than the cost of the rows in Eij in solution n". 

Let us consider the case when at least one of the sets Ri and Rj (w.l.o.g. Ri) is associated with a 
type b solution in n" and not in H Notice that by construction, the rows in groups dijivi), g % (vj,Vj) are 
clustered in both n and II". Now, if n induces a cost of at least 11 for the rows in Eij, since n" induces 
a cost of at most 11 for the rows in Eij it follows that cn(Eij) > cu"(Eij). If n induces a cost of 10 for 
the rows in Eij, then by Prop. 23 g 2 {yi, Vj) must be co-clustered with g J (vi, Vj). Then, it follows that by 



construction Rj is associated with a type a solution in n" and that the rows of Eij have a total cost of 
10 in n". Hence c n (E z j) > c n »(E i:j ). 

Now, let us consider the case when both Ri, Rj, are associated with a type a solution in n". In this 
case, by construction, the rows in the edge set Eij have a total cost of 11 in n", while they have a cost 
of at least 10 in n, as the rows in Ut=4 5 6 9t( v i, v j) (contained in the set Sij of n) have a total cost of 
6, while each of the 4 rows of Eij \ IJ t=4 5 6 gt(vi, Vj) has a cost of at least 1 in n. Assume that the rows 
of Eij have a total cost of 10 in n. By Lemma |3Tj n induces a total cost of 11 for the rows of one of 



the sets Ri, Rj (w.l.o.g. Ri). Notice that by Lemma 25 n" induces a cost of 9 for all the set Ri. Now, 
let us consider the set Ri and the three edge sets Eij, E^h, E i: k- In what follows, we will consider the 
cost induced by n and by n" in the set Ri and in some of the edge sets Eij, E^, E^.. More precisely, 
for each edge set Ei, x in {Eij,Ei^,E it k}, let us consider its cost together with the cost of Ri only if 
di tX {vi) and g l (vi,v x ) are clustered in n (otherwise Ei tX will be eventually be considered together with 
R x ). By construction the cost of at most two edge sets in Eij, E^h, Ei : k (assume w.l.o.g. Eij, Ei,h) 
are considered together with the cost of Ri, otherwise di lX (vi) and g l (vi,v x ) would be co-custered in n, 
for each x £ {j,h,k} and by construction Ri would be associated with a type b solution in n". Since 
cn(Eij) > 10, c n {E hh ) > 10, c n {R l ) > 11, while c n »(Ei,j) = H« c n »(^,h) = 11 and c n {R l ) = 9 it follows 
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that c n {Eij) + cu(E it h) + c u (Ri) > cu"(E i:j ) + cn»(E ith ) + c n >>(Ri)- 

Now, we have shown that c(II) > c(II"). Notice that II" may not be a canonical solution, as there 
may exist two sets Ri, Rj, with Eij part of the instance, associated with a type b solution in II". Now, 
applying Lemma 30 for each pair of sets Ri, Rj, associated with a type b solution in II", with E^j part of 
the instance, we can compute a canonical solution II' such that c(II") > c(n'). Hence c(II) > c(il'). □ 



Lemma 33. Let C be cover of G. Then, we can compute in polynomial time a solution LT of 3-AP(3) 
over instance R of cost 6\V\ + Z\C\ + ll\E\ + 9. 

Proof. We can define a solution LT of 3-AP(3) of cost 6\V\ + 3|C| + + 9, as follows. Define a type a 

solution for each Ri associated with a vertex Vi £ C. Each of such sets has a cost of 9. 

Define a type b solution for set Ri associated with a vertex v t e and define an i-normal solution 

for the sets Eij, E^h, Eij. Each such set Ri has a cost of 9, and each edge set in {Eij, E^, Eij} has 
a cost of 10. Accounting this decreasing of the cost of the edge sets (from 11 to 10) to the set Ri, is 
equivalent to assign to a type b solution a cost equal to 6. 

For any other edge set Ei t j add to LT the following sets: Si = g±(vi,Vj) U Q2{vi,Vj), S2 = gs(vi,Vj) U 
9i(vi, Vj) U g${vi, Vj)U ge(vi, Vj). Each such edge set has a cost of 11. Finally, define a set X = {x±, X2,x$), 
having a total cost of 9. □ 
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